{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "audio.ipynb",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": [
        "Tce3stUlHN0L"
      ],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow IO Authors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "tuOe1ymfHZPu",
        "colab": {}
      },
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "qFdPvlXBOdUN"
      },
      "source": [
        "# Audio Data Preparation and Augmentation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/audio\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/audio.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/audio.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "      <td>\n",
        "    <a href=\"https://raw.githubusercontent.com/tensorflow/io/master/docs/tutorials/audio.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## Overview\n",
        "\n",
        "One of the biggest challanges in Automatic Speech Recognition is the preparation and augmentation of audio data. Audio data analysis could be in time or frequency domain, which adds additional complex compared with other data sources such as images.\n",
        "\n",
        "As a part of the TensorFlow ecosystem, `tensorflow-io` package provides quite a few useful audio-related APIs that helps easing the preparation and augmentation of audio data."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "upgCc3gXybsA",
        "colab_type": "text"
      },
      "source": [
        "### Install required Packages, and restart runtime"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uUDYyMZRfkX4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install tensorflow-io-nightly"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J0ZKhA6s0Pjp",
        "colab_type": "text"
      },
      "source": [
        "## Usage"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yZmI7l_GykcW",
        "colab_type": "text"
      },
      "source": [
        "### Read an Audio File\n",
        "\n",
        "In TensorFlow IO, class `tfio.audio.AudioIOTensor` allows you to read an audio file into a lazy-loaded `IOTensor`:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nS3eTBvjt-O5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow_io as tfio\n",
        "\n",
        "audio = tfio.audio.AudioIOTensor('gs://cloud-samples-tests/speech/brooklyn.flac')\n",
        "\n",
        "print(audio)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z9GCyPWNuOm7",
        "colab_type": "text"
      },
      "source": [
        "In the above example, the Flac file `brooklyn.flac` is from a publicly accessible audio clip in [google cloud](https://cloud.google.com/speech-to-text/docs/quickstart-gcloud).\n",
        "\n",
        "The GCS address `gs://cloud-samples-tests/speech/brooklyn.flac` are used directly because GCS is a supported file system in TensorFlow. In addition to `Flac` format, `WAV`, `Ogg`, `MP3`, and `MP4A` are also supported by `AudioIOTensor` with automatic file format detection.\n",
        "\n",
        "`AudioIOTensor` is lazy-loaded so only shape, dtype, and sample rate are shown initially. The shape of the `AudioIOTensor` is represented as `[samples, channels]`, which means the audio clip we loaded is mono channel with `28979` samples in `int16`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IF_kYz_o2DH4",
        "colab_type": "text"
      },
      "source": [
        "The content of the audio clip will only be read as needed, either by converting `AudioIOTensor` to `Tensor` through `to_tensor()`, or though slicing. Slicing is especially useful when only a small portion of a large audio clip is needed:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wtM_ixN724xb",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "audio_slice = audio[100:]\n",
        "\n",
        "# remove last dimension\n",
        "audio_tensor = tf.squeeze(audio_slice, axis=[-1])\n",
        "\n",
        "print(audio_tensor)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IGnbXuVnSo8T",
        "colab_type": "text"
      },
      "source": [
        "The audio can be played through:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0rLbVxuFSvVO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from IPython.display import Audio\n",
        "\n",
        "Audio(audio_tensor.numpy(), rate=audio.rate.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fmt4cn304IbG",
        "colab_type": "text"
      },
      "source": [
        "It is more convinient to convert tensor into float numbers and show the audio clip in graph:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZpwajOeR4UMU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "\n",
        "tensor = tf.cast(audio_tensor, tf.float32) / 32768.0\n",
        "\n",
        "plt.figure()\n",
        "plt.plot(tensor.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "86qE8BPl5rcA",
        "colab_type": "text"
      },
      "source": [
        "### Trim the noise\n",
        "\n",
        "Sometimes it makes sense to trim the noise from the audio, which could be done through API `tfio.experimental.audio.trim`. Returned from the API is a pair of `[start, stop]` position of the segement:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eEa0Z5U26Ep3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "position = tfio.experimental.audio.trim(tensor, axis=0, epsilon=0.1)\n",
        "print(position)\n",
        "\n",
        "start = position[0]\n",
        "stop = position[1]\n",
        "print(start, stop)\n",
        "\n",
        "processed = tensor[start:stop]\n",
        "\n",
        "plt.figure()\n",
        "plt.plot(processed.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ineBzDeu-lTh",
        "colab_type": "text"
      },
      "source": [
        "### Fade In and Fade Out\n",
        "\n",
        "One useful audio engineering technique is fade, which gradually increases or decreases audio signals. This can be done through `tfio.experimental.audio.fade`. `tfio.experimental.audio.fade` supports different shapes of fades such as `linear`, `logarithmic`, or `exponential`:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LfZo0XaaAaeM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "fade = tfio.experimental.audio.fade(\n",
        "    processed, fade_in=1000, fade_out=2000, mode=\"logarithmic\")\n",
        "\n",
        "plt.figure()\n",
        "plt.plot(fade.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7rhLvOSZB0k0",
        "colab_type": "text"
      },
      "source": [
        "### Spectrogram\n",
        "\n",
        "Advanced audio processing often works on frequency changes over time. In `tensorflow-io` a waveform can be converted to spectrogram through `tfio.experimental.audio.spectrogram`:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UyFMBK-LDDnN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Convert to spectrogram\n",
        "spectrogram = tfio.experimental.audio.spectrogram(\n",
        "    fade, nfft=512, window=512, stride=256)\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(tf.math.log(spectrogram).numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pZ92HnbJGHBS",
        "colab_type": "text"
      },
      "source": [
        "Additional transformation to different scales are also possible:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZgyedQdxGM2y",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Convert to mel-spectrogram\n",
        "mel_spectrogram = tfio.experimental.audio.melscale(\n",
        "    spectrogram, rate=16000, mels=128, fmin=0, fmax=8000)\n",
        "\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(tf.math.log(mel_spectrogram).numpy())\n",
        "\n",
        "# Convert to db scale mel-spectrogram\n",
        "dbscale_mel_spectrogram = tfio.experimental.audio.dbscale(\n",
        "    mel_spectrogram, top_db=80)\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(dbscale_mel_spectrogram.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nXd776xNIr_I",
        "colab_type": "text"
      },
      "source": [
        "### SpecAugment\n",
        "\n",
        "In addition to the above mentioned data preparation and augmentation APIs, `tensorflow-io` package also provides advanced spectrogram augmentations, most notably Frequency and Time Masking discussed in [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (Park et al., 2019)](https://arxiv.org/pdf/1904.08779.pdf)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dajm7k-2J5l7",
        "colab_type": "text"
      },
      "source": [
        "#### Frequency Masking\n",
        "\n",
        "In frequency masking, frequency channels `[f0, f0 + f)` are masked where `f` is chosen from a uniform distribution from `0` to the frequency mask parameter `F`, and `f0` is chosen from `(0, ν − f)` where `ν` is the number of frequency channels."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kLEdfkkoK27A",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Freq masking\n",
        "freq_mask = tfio.experimental.audio.freq_mask(dbscale_mel_spectrogram, param=10)\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(freq_mask.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_luycpCWLe5l",
        "colab_type": "text"
      },
      "source": [
        "#### Time Masking\n",
        "\n",
        "In time masking, `t` consecutive time steps `[t0, t0 + t)` are masked where `t` is chosen from a uniform distribution from `0` to the time mask parameter `T`, and `t0` is chosen from `[0, τ − t)` where `τ` is the time steps."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "G1ie8J3wMMEI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Time masking\n",
        "time_mask = tfio.experimental.audio.time_mask(dbscale_mel_spectrogram, param=10)\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(time_mask.numpy())"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}